perm filename MULTID[4,KMC]4 blob
sn#045452 filedate 1973-05-29 generic text, type T, neo UTF8
00100 MULTIDIMENSIONAL EVALUATION OF A SIMULATION
00200 OF PARANOID THOUGHT PROCESSES
00300
00400 KENNETH MARK COLBY
00500 AND
00600 FRANKLIN DENNIS HILF
00700
00800 Once a simulation model reaches a stage of intuitive
00900 adequacy, a model builder should consider using more stringent
01000 evaluation procedures relevant to the model's purposes. For example,
01100 if the model is to serve as a as a training device, then a simple
01200 evaluation of its pedagogic effectiveness would be sufficient. But
01300 when the model is proposed as an explantion of a psychological
01400 process, more is demanded of the evaluation procedure.
01500 We shall first give a brief description of a model of
01600 paranoid processes. A more complete account can be found in Colby,
01700 Weber, and Hilf [1]. We shall then discuss the evaluation
01800 problem which asks "how good is the model?" or "how close is the
01900 correspondence between the behavior of the model and the phenomenena
02000 it is intended to explain?"
02100 (LEE-- INSERT DESCRIPTION OF MODEL HERE)
02200 Turing's test has often been suggested as a validation procedure.
02300 It is very easy to become confused about Turing's Test. In
02400 part this is due to Turing himself who introduced the now-famous
02500 imitation game in a paper entitled COMPUTING MACHINERY AND
02600 INTELLIGENCE (Turing,1950). A careful reading of this paper reveals
02700 there are actually two imitation games , the second of which is
02800 commonly called Turing's test.
02900 In the first imitation game two groups of judges try to
03000 determine which of two interviewees is a woman. Communication between
03100 judge and interviewee is by teletype. Each judge is initially
03200 informed that one of the interviewees is a woman and one a man who
03300 will pretend to be a woman. After the interview, the judge is asked
03400 what we shall call the woman-question i.e. which interviewee was the
03500 woman? Turing does not say what else the judge is told but one
03600 assumes the judge is NOT told that a computer is involved nor is he
03700 asked to determine which interviewee is human and which is the
03800 computer. Thus, the first group of judges would interview two
03900 interviewees: a woman, and a man pretending to be a woman.
04000 The second group of judges would be given the same initial
04100 instructions, but unbeknownst to them, the two interviewees would be
04200 a woman and a computer programmed to imitate a woman. Both groups
04300 of judges play this game until sufficient statistical data are
04400 collected to show how often the right identification is made. The
04500 crucial question then is: do the judges decide wrongly AS OFTEN when
04600 the game is played with man and woman as when it is played with a
04700 computer substituted for the man. If so, then the program is
04800 considered to have succeeded in imitating a woman as well as a man
04900 imitating a woman. For emphasis we repeat; in asking the
05000 woman-question in this game, judges are not required to identify
05100 which interviewee is human and which is machine.
05200 Later on in his paper Turing proposes a variation of the
05300 first game. In the second game one interviewee is a man and one is a
05400 computer. The judge is asked to determine which is man and which is
05500 machine, which we shall call the machine-question. It is this version
05600 of the game which is commonly thought of as Turing's test. It has
05700 often been suggested as a means of validating computer simulations of
05800 psychological processes.
05900 In the course of testing a simulation (PARRY) of paranoid
06000 linguistic behavior in a psychiatric interview, we conducted a number
06100 of Turing-like indistinguishability tests (Colby, Hilf,Weber and
06200 Kraemer,1972). We say `Turing-like' because none of them consisted of
06300 playing the two games described above. We chose not to play these
06400 games for a number of reasons which can be summarized by saying that
06500 they do not meet modern criteria for good experimental design. In
06600 designing our tests we were primarily interested in learning more
06700 about developing the model. We did not believe the simple
06800 machine-question to be a useful one in serving the purpose of
06900 progressively increasing the credibility of the model but we
07000 investigated a variation of it to satisfy the curiosity of colleagues
07100 in artificial intelligence.
07200 In this design eight psychiatrists interviewed by teletype
07300 two patients using the technique of machine-mediated interviewing
07400 which involves what we term "non-nonverbal" communication since
07500 non-verbal cues are made impossible (Hilf,1972). Each judge
07600 interviewed two patients one being PARRY and one being a hospitalized
07700 paranoid patient. The interviewers were not informed that a
07800 simulation was involved nor were they asked to identify which was the
07900 machine. Their task was to conduct a diagnostic psychiatric interview
08000 and rate each response from the `patients' along a 0-9 scale of
08100 paranoidness, 0 meaning zero and 9 being highest. Transcripts of
08200 these interviews, without the ratings of the interviewers, were then
08300 utilized for various experiments in which randomly selected expert
08400 judges conducted evaluations of the interview transcripts. For
08500 example, in one experiment it was found that patients and model were
08600 indistinguishable along the dimension of paranoidness.
08700 To ask the machine-question, we sent interview transcripts,
08800 one with a patient and one with PARRY, to 100 psychiatrists randomly
08900 selected from the Directory of American Specialists and the Directory
09000 of the American Psychiatric Association. Of the 41 replies 21 (51%)
09100 made the correct identification while 20 (49%) were wrong. Based on
09200 this random sample of 41 psychiatrists, the 95% confidence interval
09300 is between 35.9 and 66.5, a range which is close to chance. (Our
09400 statistical consultant was Dr. Helena C. Kraemer, Research
09500 Associate in Biostatistics, Department of Psychiatry, Stanford
09600 University.)
09700 Psychiatrists are considered expert judges of patient
09800 interview behavior but they are unfamiliar with computers. Hence we
09900 conducted the same test with 100 computer scientists randomly
10000 selected from the membership list of the Association for Computing
10100 Machinery, ACM. Of the 67 replies 32 (48%) were right and 35 (52%)
10200 were wrong. Based on this random sample of 67 computer scientists the
10300 95% confidence ranges from 36 to 60, again close to a chance level.
10400 Thus the answer to this machine-question "can expert judges,
10500 psychiatrists aand computer scientists, using teletyped transcripts
10600 of psychiatric interviews, distinguish between paranoid patients and
10700 a simulation of paranoid processes? " is "No". But what do we learn
10800 from this? It is some comfort that the answer was not "yes"and the
10900 null hypothesis (no differences) failed to be rejected, especially
11000 since statistical tests are somewhat biased in favor of rejecting the
11100 null hypothesis (Meehl,1967). Yet this answer does not tell us what
11200 we would most like to know, i.e. how to improve the model.
11300 Simulation models do not spring forth in a complete, perfect and
11400 final form; they must be gradually developed over time. Pehaps we
11500 might obtain a "yes" answer to the machine-question if we allowed a
11600 large number of expert judges to conduct the interviews themselves
11700 rather than studying transcripts of other interviewers. It would
11800 indicate that the model must be improved but unless we systematically
11900 investigated how the judges succeeded in making the discrimination we
12000 would not know what aspects of the model to work on. The logistics of
12100 such a design are immense and obtaining a large N of judges for sound
12200 statistical inference would require an effort disproportionate to the
12300 information-yield.
12400 A more efficient and informative way to use Turing-like tests
12500 is to ask judges to make ordinal ratings along scaled dimensions from
12600 teletyped interviews. We shall term this approach asking the
12700 dimension-question. One can then compare scaled ratings received by
12800 the patients and by the model to precisely determine where and by how
12900 much they differ. Model builders strive for a model which
13000 shows indistinguishability along some dimensions and
13100 distinguishability along others. That is, the model converges on what
13200 it is supposed to simulate and diverges from that which it is not.
13300 We mailed paired-interview transcripts to another 400
13400 randomly selected psychiatrists asking them to rate the responses of
13500 the two `patients' along certain dimensions. The judges were divided
13600 into groups, each judge being asked to rate responses of each I-O
13700 pair in the interviews along four dimensions. The total number of
13800 dimensions in this test were twelve- linguistic noncomprehension,
13900 thought disorder, organic brain syndrome, bizarreness, anger, fear,
14000 ideas of reference, delusions, mistrust, depression, suspiciousness
14100 and mania. These are dimensions which psychiatrists commonly use in
14200 evaluating patients.
14300 Table 1 shows there were significant differences, with PARRY
14400 receiving higher scores along the dimensions of linguistic
14500 noncomprehension,thought disorder, bizarreness, anger, mistrust and
14600 suspiciousness. On the dimension of delusions the patients were rated
14700 significantly higher. There were no significant differences along the
14800 dimensions of organic brain syndrome,fear, ideas of reference,
14900 depression and mania.
15000 While tests asking the machine-question indicate
15100 indistinguishability at the gross level, a study of the finer
15200 structure os the model's behavior through ratings along scaled
15300 dimensions shows statistically significant differences between
15400 patients and model. These differences are of help to the model
15500 builder in suggesting which aspects of the model must be modified and
15600 improved in order to be considered an adequate simulation of the
15700 class of paranoid patients it is intended to simulate. For example,
15800 it is clear that PARRY'S language-comprehension must be improved.
15900 Once this has been implemented, a future test will tell us whether
16000 improvement has occurred and by how much in comparison to the earlier
16100 version. Successive identification of particular areas of failure
16200 in the model permits their improvement and the development of
16300 more adequate model-versions.
16400 Further evidence that the machine-question is too coarse and
16500 insensitive a test comes from the following experiment. In this test
16600 we constructed a random version of the paranoid model which utilized
16700 PARRY'S output statements but expressed them randomly no matter what
16800 the interviewer said. Two psychiatrists conducted interviews with
16900 this model, transcripts of which were paired with patient interviews
17000 and sent to 200 randomly selected psychiatrists asking both the
17100 machine-question and the dimension-question. Of the 69 replies, 34
17200 (49%) were right and 35 (51%) wrong. Based on this random sample of
17300 69 psychiatrists, the 95% confidence interval ranges from 39 to 63,
17400 again indicating a chance level. However as shown in Table 2
17500 significant differences appear along the dimensions of linguistic
17600 noncomprehension, thought disorder and bizarreness, with RANDOM-PARRY
17700 rated higher. On these particular dimensions we can construct a
17800 continuum in which the random version represents one extreme, the
17900 actual patients another. Our (nonrandom) PARRY lies somewhere between
18000 these two extremes, indicating that it performs significantly better
18100 than the random version but still requires improvement before being
18200 indistinguishable from patients.(See Fig.1). Table 3 presents t
18300 values for differences between mean ratings of PARRY and
18400 RANDOM-PARRY. (See Table 2 and Fig.1 for the mean ratings).
18500 The fact that even a random model can pass the machine-question test
18600 shows, not that the model is a good simulation, but that the test
18700 is weak and nonchallenging.
18800 Thus it can be seen that such a multidimensional evaluation
18900 provides yardsticks for measuring the adequacy of this or any other
19000 dialogue simulation model along the relevant dimensions.
19100 We conclude that when model builders want to conduct tests
19200 which indicate in which direction progress lies and to obtain a
19300 measure of whether progress is being achieved, the way to use
19400 Turing-like tests is to ask expert judges to make ratings along
19500 multiple dimensions that are essential to the model. Useful tests do
19600 not prove a model, they probe it for its strengths and weaknesses.
19700 Simply asking the machine-question yields little information relevant
19800 to what the model builder most wants to know, namely, along what
19900 dimensions must the model be improved.
20000
20100
20200 REFERENCES
20300
20400 [1] Colby, K.M., Weber, S. and Hilf,F.D.,1971. Artificial paranoia.
20500 ARTIFICIAL INTELLIGENCE,2, 1-25.
20600
20700
20800 [2] Colby,K.M.,Hilf,F.D.,Weber, S.and Kraemer,H.C.,1972. Turing-like
20900 indistinguishability tests for the validation of a computer
21000 simulation of paranoid processes. ARTIFICIAL INTELLIGENCE,3,
21100 199-221.
21200
21300 [3] Hilf, F.D.,1972. Non-nonverbal communication and psychiatric research.
21400 ARCHIVES OF GENERAL PSYCHIATRY, 27, 631-635.
21500 [4] Meehl, P.E.,1967. Theory testing in psychology and physics: a
21600 methodological paradox. PHILOSOPHY OF SCIENCE,34,103-115.
21700
21800 [5] Turing,A.,1950. Computing machinery and intelligence. Reprinted in:
21900 COMPUTERS AND THOUGHT (Feigenbaum, E.A. and Feldman, J.,eds.).
22000 McGraw-Hill, New York,1963,pp. 11-35.
22100
22200
22300 ACKNOWLEDGEMENTS
22400
22500 This research is supported by Grant PHS MH 06645-12 from the National
22600 Institute of Mental Health and by (in part) Research Scientist Award
22700 (No. 1-K05-K-14,433) from the National Institute of Mental Health to
22800 the senior author.